BERT: Pre-training of Deep Bidirectional Transformers#

bert

  • The year 2018 marked a turning point for the field of Natural Language Processing (NLP).

  • The BERT [Devlin et al., 2018] paper introduced a new language representation model that outperformed all previous models on a wide range of NLP tasks.

  • BERT is a deep bidirectional transformer model that is pre-trained on a large corpus of unlabeled text.

  • The model is trained to predict masked words in a sentence and is also trained to predict the next sentence in a sequence of sentences.

  • The pre-trained model can then be fine-tuned on a variety of downstream NLP tasks with state-of-the-art results.

BERT builds on two key ideas:

BERT is pre-trained on a large corpus of unlabeled text. Its weights are learned by predicting masked words in a sentence and predicting the next sentence in a sequence of sentences.

BERT is a (multi-headed) beast

BERT is a deep bidirectional transformer model. It is a multi-headed beast with 12(24) layers, 12(16) attention heads, and 110 million parameters. Since model weights are not shared across layers, the total number of different attention weights is 12(24) x 12(16) = 144(384).

Visualizing BERT#

Because of BERT’s complexity, it is difficult to understand the meaning of its learned weights intuitively. To help with this, we can visualize the attention weights of BERT’s self-attention layers.

%pip install bertviz
%config InlineBackend.figure_format='retina'

from bertviz import model_view, head_view
from transformers import AutoTokenizer, AutoModel, utils

utils.logging.set_verbosity_error()  # Suppress standard warnings
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
model = AutoModel.from_pretrained("bert-base-uncased", output_attentions=True)
inputs = tokenizer.encode("The cat sat on the mat", return_tensors='pt')
outputs = model(inputs)
attention = outputs[-1]  # Output includes attention weights when output_attentions=True
tokens = tokenizer.convert_ids_to_tokens(inputs[0]) 
head_view(attention, tokens)
Layer:
  • The tool visualizes attention as lines connecting the position being updated (left) with the position being attended to (right).

  • Colors identify the corresponding attention head(s), while line thickness reflects the attention score.

  • At the top of the visualization, you can select the model layer and the attention head(s) to visualize.

What does BERT actually learn?#

Let’s explore the attention patterns of various layers of the BERT (the BERT-Base, uncased version).

Sentence A: I went to the store.

Sentence B: At the store, I bought fresh strawberries.

BERT uses WordPiece tokenization and inserts special classifier ([CLS]) and separator ([SEP]) tokens, so the actual input sequence is:

[CLS] I went to the store . [SEP] At the store , I bought fresh straw ##berries . [SEP]

inputs = tokenizer.encode(
    ["I went to the store.", "At the store, I bought fresh strawberries."],
    return_tensors="pt",
)
outputs = model(inputs)
attention = outputs[-1]
tokens = tokenizer.convert_ids_to_tokens(inputs[0])

Pattern 1: Attention to next word#

Select layer 2, head 0. (The selected head is indicated by the highlighted square in the color bar at the top.) Most of the attention at a particular position is directed to the next token in the sequence.

  • If you do not select any token, the visualization shows the attention pattern for all tokens in the sequence.

  • If you select a token, the visualization shows the attention pattern for the selected token.

  • If you select a token i, virtually all the attention is directed to the next token went.

  • The [SEP] token disrupts the next-token attention pattern, as most of the attention from [SEP] is directed to [CLS] (the first token in the sequence) rather than the next token.

  • This pattern, attention to the next token, appears to work primarily within a sentence.

  • This pattern is related to the idea of a recurrent neural network (RNN) that is trained to predict the next word in a sequence.

head_view(attention, tokens, layer=2, heads=[0])
Layer:

Pattern 2: Attention to previous word#

Select layer 6, head 11. In this pattern, much of the attention is directed to the previous token in the sequence.

  • For example, most of the attention from went is directed to the previous token i.

  • The pattern is not as distinct as the next-token pattern, but it is still present.

  • Some attention is also dispersed to other tokens in the sequence, especially to the [SEP] token.

  • This pattern is also related to the idea of an RNN, in this case the forward direction of an RNN.

head_view(attention, tokens, layer=6, heads=[11])
Layer:

Pattern 5: Attention to other words predictive of word#

Select layer 2, head 1. In this pattern, attention seems to be directed to other words that are predictive of the source word, excluding the source word itself.

  • For example, most of the attention for straw is directed to ##berries, and most of the attention from ##berries is focused on straw.

head_view(attention, tokens, layer=2, heads=[1])
Layer:

Pattern 6: Attention to delimiter tokens#

Select layer 6, head 4. In this pattern, attention is directed to the delimiter tokens, [CLS] and [SEP].

  • For example, most of the attention is directed to [SEP].

head_view(attention, tokens, layer=6, heads=[4])
Layer: